{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "83406f43",
   "metadata": {},
   "source": [
    "### P027 乳腺癌数据集 - 加载并认识数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "6d4964b1",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_breast_cancer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "a0c704ac",
   "metadata": {},
   "outputs": [],
   "source": [
    "breast_cancer = load_breast_cancer()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "6ec66a9a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ".. _breast_cancer_dataset:\n",
      "\n",
      "Breast cancer wisconsin (diagnostic) dataset\n",
      "--------------------------------------------\n",
      "\n",
      "**Data Set Characteristics:**\n",
      "\n",
      "    :Number of Instances: 569\n",
      "\n",
      "    :Number of Attributes: 30 numeric, predictive attributes and the class\n",
      "\n",
      "    :Attribute Information:\n",
      "        - radius (mean of distances from center to points on the perimeter)\n",
      "        - texture (standard deviation of gray-scale values)\n",
      "        - perimeter\n",
      "        - area\n",
      "        - smoothness (local variation in radius lengths)\n",
      "        - compactness (perimeter^2 / area - 1.0)\n",
      "        - concavity (severity of concave portions of the contour)\n",
      "        - concave points (number of concave portions of the contour)\n",
      "        - symmetry\n",
      "        - fractal dimension (\"coastline approximation\" - 1)\n",
      "\n",
      "        The mean, standard error, and \"worst\" or largest (mean of the three\n",
      "        worst/largest values) of these features were computed for each image,\n",
      "        resulting in 30 features.  For instance, field 0 is Mean Radius, field\n",
      "        10 is Radius SE, field 20 is Worst Radius.\n",
      "\n",
      "        - class:\n",
      "                - WDBC-Malignant\n",
      "                - WDBC-Benign\n",
      "\n",
      "    :Summary Statistics:\n",
      "\n",
      "    ===================================== ====== ======\n",
      "                                           Min    Max\n",
      "    ===================================== ====== ======\n",
      "    radius (mean):                        6.981  28.11\n",
      "    texture (mean):                       9.71   39.28\n",
      "    perimeter (mean):                     43.79  188.5\n",
      "    area (mean):                          143.5  2501.0\n",
      "    smoothness (mean):                    0.053  0.163\n",
      "    compactness (mean):                   0.019  0.345\n",
      "    concavity (mean):                     0.0    0.427\n",
      "    concave points (mean):                0.0    0.201\n",
      "    symmetry (mean):                      0.106  0.304\n",
      "    fractal dimension (mean):             0.05   0.097\n",
      "    radius (standard error):              0.112  2.873\n",
      "    texture (standard error):             0.36   4.885\n",
      "    perimeter (standard error):           0.757  21.98\n",
      "    area (standard error):                6.802  542.2\n",
      "    smoothness (standard error):          0.002  0.031\n",
      "    compactness (standard error):         0.002  0.135\n",
      "    concavity (standard error):           0.0    0.396\n",
      "    concave points (standard error):      0.0    0.053\n",
      "    symmetry (standard error):            0.008  0.079\n",
      "    fractal dimension (standard error):   0.001  0.03\n",
      "    radius (worst):                       7.93   36.04\n",
      "    texture (worst):                      12.02  49.54\n",
      "    perimeter (worst):                    50.41  251.2\n",
      "    area (worst):                         185.2  4254.0\n",
      "    smoothness (worst):                   0.071  0.223\n",
      "    compactness (worst):                  0.027  1.058\n",
      "    concavity (worst):                    0.0    1.252\n",
      "    concave points (worst):               0.0    0.291\n",
      "    symmetry (worst):                     0.156  0.664\n",
      "    fractal dimension (worst):            0.055  0.208\n",
      "    ===================================== ====== ======\n",
      "\n",
      "    :Missing Attribute Values: None\n",
      "\n",
      "    :Class Distribution: 212 - Malignant, 357 - Benign\n",
      "\n",
      "    :Creator:  Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian\n",
      "\n",
      "    :Donor: Nick Street\n",
      "\n",
      "    :Date: November, 1995\n",
      "\n",
      "This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets.\n",
      "https://goo.gl/U2Uwz2\n",
      "\n",
      "Features are computed from a digitized image of a fine needle\n",
      "aspirate (FNA) of a breast mass.  They describe\n",
      "characteristics of the cell nuclei present in the image.\n",
      "\n",
      "Separating plane described above was obtained using\n",
      "Multisurface Method-Tree (MSM-T) [K. P. Bennett, \"Decision Tree\n",
      "Construction Via Linear Programming.\" Proceedings of the 4th\n",
      "Midwest Artificial Intelligence and Cognitive Science Society,\n",
      "pp. 97-101, 1992], a classification method which uses linear\n",
      "programming to construct a decision tree.  Relevant features\n",
      "were selected using an exhaustive search in the space of 1-4\n",
      "features and 1-3 separating planes.\n",
      "\n",
      "The actual linear program used to obtain the separating plane\n",
      "in the 3-dimensional space is that described in:\n",
      "[K. P. Bennett and O. L. Mangasarian: \"Robust Linear\n",
      "Programming Discrimination of Two Linearly Inseparable Sets\",\n",
      "Optimization Methods and Software 1, 1992, 23-34].\n",
      "\n",
      "This database is also available through the UW CS ftp server:\n",
      "\n",
      "ftp ftp.cs.wisc.edu\n",
      "cd math-prog/cpo-dataset/machine-learn/WDBC/\n",
      "\n",
      ".. topic:: References\n",
      "\n",
      "   - W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction \n",
      "     for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on \n",
      "     Electronic Imaging: Science and Technology, volume 1905, pages 861-870,\n",
      "     San Jose, CA, 1993.\n",
      "   - O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and \n",
      "     prognosis via linear programming. Operations Research, 43(4), pages 570-577, \n",
      "     July-August 1995.\n",
      "   - W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques\n",
      "     to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) \n",
      "     163-171.\n"
     ]
    }
   ],
   "source": [
    "print(breast_cancer[\"DESCR\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2ab52a8c",
   "metadata": {},
   "source": [
    "### P028 查看data和target"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "119b0cd6",
   "metadata": {},
   "outputs": [],
   "source": [
    "data = breast_cancer[\"data\"]\n",
    "target = breast_cancer[\"target\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "2fd967ac",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(569, 30)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "968b6016",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(569,)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "target.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "1b2c3987",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(numpy.ndarray, numpy.ndarray)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(data), type(target)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "76ec87b9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1.799e+01, 1.038e+01, 1.228e+02, 1.001e+03, 1.184e-01, 2.776e-01,\n",
       "        3.001e-01, 1.471e-01, 2.419e-01, 7.871e-02, 1.095e+00, 9.053e-01,\n",
       "        8.589e+00, 1.534e+02, 6.399e-03, 4.904e-02, 5.373e-02, 1.587e-02,\n",
       "        3.003e-02, 6.193e-03, 2.538e+01, 1.733e+01, 1.846e+02, 2.019e+03,\n",
       "        1.622e-01, 6.656e-01, 7.119e-01, 2.654e-01, 4.601e-01, 1.189e-01],\n",
       "       [2.057e+01, 1.777e+01, 1.329e+02, 1.326e+03, 8.474e-02, 7.864e-02,\n",
       "        8.690e-02, 7.017e-02, 1.812e-01, 5.667e-02, 5.435e-01, 7.339e-01,\n",
       "        3.398e+00, 7.408e+01, 5.225e-03, 1.308e-02, 1.860e-02, 1.340e-02,\n",
       "        1.389e-02, 3.532e-03, 2.499e+01, 2.341e+01, 1.588e+02, 1.956e+03,\n",
       "        1.238e-01, 1.866e-01, 2.416e-01, 1.860e-01, 2.750e-01, 8.902e-02],\n",
       "       [1.969e+01, 2.125e+01, 1.300e+02, 1.203e+03, 1.096e-01, 1.599e-01,\n",
       "        1.974e-01, 1.279e-01, 2.069e-01, 5.999e-02, 7.456e-01, 7.869e-01,\n",
       "        4.585e+00, 9.403e+01, 6.150e-03, 4.006e-02, 3.832e-02, 2.058e-02,\n",
       "        2.250e-02, 4.571e-03, 2.357e+01, 2.553e+01, 1.525e+02, 1.709e+03,\n",
       "        1.444e-01, 4.245e-01, 4.504e-01, 2.430e-01, 3.613e-01, 8.758e-02]])"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data[:3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "8b878d81",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 0, 0])"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "target[:3]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "133e0f2b",
   "metadata": {},
   "source": [
    "### P029 合并data和target到大数组"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "df875ea7",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "68b181c7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(569, 30)"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "8d2f3c02",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(569,)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "target.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "5cab2113",
   "metadata": {},
   "outputs": [],
   "source": [
    "all_datas = np.c_[data, target]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "7b677947",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([2.057e+01, 1.777e+01, 1.329e+02, 1.326e+03, 8.474e-02, 7.864e-02,\n",
       "       8.690e-02, 7.017e-02, 1.812e-01, 5.667e-02, 5.435e-01, 7.339e-01,\n",
       "       3.398e+00, 7.408e+01, 5.225e-03, 1.308e-02, 1.860e-02, 1.340e-02,\n",
       "       1.389e-02, 3.532e-03, 2.499e+01, 2.341e+01, 1.588e+02, 1.956e+03,\n",
       "       1.238e-01, 1.866e-01, 2.416e-01, 1.860e-01, 2.750e-01, 8.902e-02,\n",
       "       0.000e+00])"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_datas[1]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2a8e701a",
   "metadata": {},
   "source": [
    "### P030 乳腺癌数据集 - 生成Pandas的df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "3525cda7",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "12c4225e",
   "metadata": {},
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'pd' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-1-fe91903a6417>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m df = pd.DataFrame(\n\u001b[0m\u001b[0;32m      2\u001b[0m     \u001b[0mdata\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mall_datas\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      3\u001b[0m     columns=list(breast_cancer['feature_names']) + ['target'])\n\u001b[0;32m      4\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      5\u001b[0m \u001b[0mprint\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdf\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mhead\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;31mNameError\u001b[0m: name 'pd' is not defined"
     ]
    }
   ],
   "source": [
    "df = pd.DataFrame(\n",
    "    data=all_datas, \n",
    "    columns=list(breast_cancer['feature_names']) + ['target'])\n",
    "\n",
    "print(df.head(1))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a8784cdb",
   "metadata": {},
   "source": [
    "### P031 乳腺癌数据集 - 拆分训练集和测试集"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "d37f6bd6",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "c483fa9c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(569, 30)"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "58a3d3b2",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(569,)"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "target.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "60458543",
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    data, target, random_state=40, test_size=0.25)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "b8466b1d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((426, 30), (426,), (143, 30), (143,))"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train.shape, y_train.shape, X_test.shape, y_test.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b14fde64",
   "metadata": {},
   "source": [
    "### P032 乳腺癌数据集 - 训练测试集数据分布"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "a588af8d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((569,), (426,), (143,))"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "target.shape, y_train.shape, y_test.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "f86e5fd3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "target\n",
      "1    0.627417\n",
      "0    0.372583\n",
      "dtype: float64\n",
      "y_train\n",
      "1    0.607981\n",
      "0    0.392019\n",
      "dtype: float64\n",
      "y_test\n",
      "1    0.685315\n",
      "0    0.314685\n",
      "dtype: float64\n"
     ]
    }
   ],
   "source": [
    "for name, array in zip([\"target\", \"y_train\", \"y_test\"], [target, y_train, y_test]):\n",
    "    print(name)\n",
    "    print(pd.Series(array).value_counts(normalize=True))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0a328105",
   "metadata": {},
   "source": [
    "### P033 乳腺癌数据集 - 训练测试集的均匀拆分"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "d9049c21",
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    data, target, random_state=40, test_size=0.25, stratify=target)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "94da01a2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "target\n",
      "1    0.627417\n",
      "0    0.372583\n",
      "dtype: float64\n",
      "y_train\n",
      "1    0.626761\n",
      "0    0.373239\n",
      "dtype: float64\n",
      "y_test\n",
      "1    0.629371\n",
      "0    0.370629\n",
      "dtype: float64\n"
     ]
    }
   ],
   "source": [
    "for name, array in zip([\"target\", \"y_train\", \"y_test\"], [target, y_train, y_test]):\n",
    "    print(name)\n",
    "    print(pd.Series(array).value_counts(normalize=True))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cd24a060",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
